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STRUCTURE EXTRACTION ON ine and tag all paragraphs and subparagraphs manually 

ELECTRONIC DOCUMENTS according to their visually recognized structural type. This is 

tedious and time consuming, and often an impracticable 

BACKGROUND OF THE INVENTION process for large documents. 

The present invention relates to techniques that identify 5 SUMMARY OF THE INVENTION 

and categorize paragraphs, subparagraphs, and structural T n accordance with the present invention, a method of 

groupings in electronic documents, and more particularly to extracting structure information from an electronic docu- 

techniques that build a structure hierarchy from structural ment includes the step of identifying a structure type for 

groupings, 1Q eacn instance in the electronic document by examining 

An electronic document typically has information presentation attributes associated with each instance. With 

content, such as text, graphics, and tables, and formatting such an arrangement, an unstructured electronic document 

information that directs how the content is to be displayed. can be provided with structural tags. 

An electronic document resides on a digital, though not Among the advantages of the invention are one or more 

necessarily electronic, computer storage medium. An elec- 15 0 f the following. The invention enables an electronic docu- 

tronic document is generally provided by an author, men t development system to perform global operations on 

distributor, or publisher who desires that the document be a u paragraph and subparagraph instances by structural type, 

viewed with the appearance with which it was created. Global operations include, but are not limited to, format 

Electronic documents may be widely distributed and, changes, searches, word and phrase replacements, and 

therefore, can be viewed on a great variety of hardware and 2 o extractions. The invention enables the electronic document 

software platforms. A hypertext document is an electronic development system to perform operations on the structure 

document with links, which are explicit, user-selectable 0 f tne electronic document (e.g., rearrange the hierarchy or 

navigation elements. subdivide the structure). The invention permits an electronic 

Generally, electronic and human perceptible documents document to be rearranged or divided according to structural 

include a set of paragraphs. Each instance of a paragraph 25 groupings of the document. The invention enables an elec- 

shares characteristics with other paragraphs. Paragraphs that . tronic document to be rearranged based on full sections and 

share visual characteristics can be considered the same identifiable units. The invention enables the document to be 

structural type. Examples of structural paragraph types are . split based on logical organization. 

titles, headers, and footnotes. Other features and advantages of the invention will 

In addition, in all documents, paragraphs can have 30 become apparent from the following description and from 

subparagraphs, which are character streams. Each instance the claims. 

of a subparagraph shares similar characteristics with other BRIEF DESCRIPTION OF THE DRAWINGS 

subparagraphs that are the same structural type. Examples of 

subparagraph structural types are book titles, quotations, and FIG. 1 is a schematic diagram of a computer platform 

foreign words and phrases. 35 suitable for supporting a structural extractor system in 

A document typically has a logical organization. Within acc °'dance w *h the present invention; 

the logical organization are identifiable structural groups. A ^9" 2 ^ a flow . cnart of a mctnod for converting and 

series of chapters containing paragraphs is an example of a dividing an electronic document for use by another docu- 

structural group, as is a section that contains a heading, mentation system; 

several paragraphs, and a bulleted list. 40 FIG- 2a is an example of a table that maps source to 

Organizing components in an electronic document by destination tags; 

structural type permits an electronic document development FIG. 3 is a flow chart of a method that gathers statistical 

system to perform global operations on all instances of the information about an electronic document for use in the 

same type within the electronic document. For example, the 45 method of FIG. 2; 

FrameMaker® document publishing system, available from FIG. 4 is a flow chart of a method for mapping presen- 

Adobe Systems Incorporated of San Jose, Calif, can globally tational attributes in a document to structural types useful for 

change the justification of all paragraphs tagged as a par- the method of FIG. 2; 

ticular type in the electronic document and can globally FIG. 5 is a flow chart of a method for determining whether 

change the font size of all characters tagged as a particular 5Q a paragraph is a heading; 

type in the electronic document. FIG. 6 is a flow chart of a method that sorts heading types; 

Standard type formats exist for particular uses and for FIG. 7 is a flow chart of a method that sorts headings 

particular systems. For example, the HyperText Markup according to name; 

Language (HTML) uses the embedded tags <P> and </P> to FIG. la is a table that contains heading levels; 

delimit paragraphs, and <B> and </B> to delimit bold text. 5S FIG. 8 is a diagram that shows a hierarchy of instances in 

HTML also specifies many other tags including tags for a document; 

titles, menus, definitions, quoted blocks, and heading styles. piG. 9 is a flow chart of a method that builds a hierar- 

For an electronic document to have the desired visual chicaJ ttce structure f or ^ electronic document; 

appearance when viewed with a World Wide Web browser, FIG. 9a is a table that maps a link in a source electronic 

the electronic document must have the appropriate HTML a doaiment 

to a destination electronic document; and 

tagS " FIG. 10 is a flowchart that shows a method that divides a 

When viewed on paper or on a computer display, the ^ slructurc into subdocuments. 
different structural paragraph types in a document, such as 

headings and lists, are readily identifiable. However, to DESCRIPTION OF THE PREFERRED 

enable a system to perform operations based on structural 65 EMBODIMENTS 

types, such as modifying, rearranging, displaying, or print- Referring to FIG. 1, a general purpose computer platform 

ing a document, will generally require that someone exam- 100 suitable for supporting an electronic document devel- 
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op men t system and including a structural extractor 102 is instances for each original paragraph type (step 340). The 

shown. The platform (e.g., a personal computer or system also computes the average batch-length for an origi- 

workstation) includes a digital computer 104, a display 106, nal paragraph type by dividing the number of instances of 

a keyboard 108, a mouse 110 or other pointing device, and the same type by the number of batches of that type (step 

a mass storage device 112 (e.g., hard disk drive, magneto- 5 350). These statistics are used in conjunction with the 

optical disk drive, or floppy disk drive). The computer 104 method described in reference to FIG. 5. 

is of conventional construction and includes memory 120, a A checklist containing factors is used to map paragraph 

processor 122, and other customary components (e.g., m ^ subparagraph types in a source electronic document to 

memory bus and peripheral bus). structural types in a destination electronic document. One 

An electronic document 130 stores information on a hard 10 factor is the combination of existing style settings, such as 
disk or other computer readable medium such as a diskette. relative character weights, relative font sizes, italic 
An electronic document is viewable in a human perceptible characters, and indentations. Other factors include the place- 
representation on a computer display 132 or as a hardcopy ment of paragraphs and subparagraphs in relationship to 
printout through operation of a computer program. other paragraphs and subparagraphs, numbering schemes, 

A system capable of converting presentational attribute 15 and repetitive structures (e.g., bulleted lists), 

information in a source electronic document to a pre- The system maps each source type to a destination type, 

determined set of structural types in a destination electronic A mapping between types can occur although the source 

document would eliminate human intervention in the con- type does not have all characteristics of a destination type, 

version process. Furthermore, such a system is useful for * However, the more characteristics present, the higher the 

structuring a document as a structural tree hierarchy. Such a 20 probability that the source type is a specific destination type, 

system would need to identify and map standard and user- Referring to FIG. 4, the steps used to map paragraph 

defined types in the source electronic document to appro- presentational attribute types in the source system to struc- 

priate structural types in the destination electronic docu- tural paragraph types in the destination system are shown, 

ment. ^ The system gets the next original paragraph type from a 

Other processes that typically requires human interven- catalog that defines all original paragraph types for the 
tion are rearranging parts of an electronic document and electronic document (step 402). Trie system determines 
dividing an electronic document into electronic sub- whether the paragraph type is a heading (step 404) or a list 
documents. A system capable of identifying logical breaks, element (step 408). If the paragraph is neither, the system 
for example, at the beginning of a chapter or a section, can 3Q identifies it as a default paragraph type (step 410). Before 
automate the rearrangement and division processes. Such mapping a type to a default paragraph (step 410), the system 
automated processes must maintain links (e.g., hypertext can perform additional tests to determine whether the para- 
links) to other components in the electronic document. graph is a footnote, bibliographic element, quoted passage, 

Referring to FIG. 2, a system 200 that identifies format- and so forth, 

ting styles or types received as input a source electronic 35 To determine if the paragraph structural type is a list, the 

document and outputs one or more electronic documents system checks whether the paragraph has an automatically 

with structural types recognizable by a destination system. generated prefix, which indicates that the paragraph is an 

The destination system may be the same or a different ordered list. In a FrameMaker source document, for 

system than the source system. The system examines an example, if a format contains the characters and" >", the 

electronic document and collects statistics about the para- ^ system tags the paragraph as an ordered list (step 416). 

graph instances in the electronic document (step 202). Using These characters enclose a code that specifies a quantity, 

this information, if the source electronic document has such as a number, that varies for each instance. Otherwise, 

original presentational attribute information in the form of a the system identifies the structural paragraph type as an 

named type, the system creates a tag table, having at least unordered list (step 414). 

two-columns as shown in FIG, 2a, mapping each original 4S Referrillg t0 FIG . 5> me system considers a number of 

paragraph type in the source electronic document to the fact0K to determine if me struc tural paragraph type is a 

structural type for the destination system (step 204). The tag heading. The system checks the placement of the paragraph 

table can contain information that indicates if a paragraph (step 510) and tf the placemeat ^ on the side of a page within 

having that type can separate from the electronic document. m area predominated by white space, the original paragraph 

Structural types serve as a basis for building a tree so type maps to a heading (step 590). If the name of the 

structure (step 206) that represents the structural organiza- paragraph type begins with the letter "H" or "h" , and ends 

tion of instances in the electronic document. The system can with a number (step 520), the paragraph type maps to a 

optionally divide the electronic document into subdocu- heading (step 590), If the name of the paragraph type is 

ments (step 208). Smaller files are easier to download and "Tide" (step 530), the paragraph type maps to a heading 

view using a World Wide Web browser, for example. The 55 (step 590). If the paragraph type has at least one instance and 

system also can create output files (step 210) with structural there is at least one batch (step 540), the system uses the 

type tags that the destination system will recognize. statistics gathered to create a weighting factor. This factor is 

Referring to FIG. 3, the system gathers statistics on each the inverse of the average batch-length multiplied by the 

paragraph. The system can gather statistics while reading the average number of lines (step 550). If the paragraph type is 

source electronic document during one or more passes. 60 automatically numbered, this weighting factor is multiplied 

Statistics include the number of instances of each original by the empirical constant 1.5 (step 560). If the paragraph 

paragraph type (step 310), the total number of lines for all type is straddled (i.e., spans across multiple columns), the 

instances having the same paragraph type (step 320), and the weighting factor is multiplied by the empirical constant 1.5 

number of groups of consecutive instances of a particular (step 570). If the paragraph type is automatically numbered 

type (step 330). Each group of consecutive instances of the 65 and is straddled, the weighting factor increases twice. The 

original type is referred to as a batch. From these statistics, system compares the weighting factor to the constant 0.9 

the system computes the average number of lines for all the (step 580) and if the weighting factor is greater, the para- 
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graph type is classified as a heading (step 590). As the where each instance is represented by a node in the hierar- 

system assigns headings, it builds a heading table listing chy. The system arranges the nodes according to the logical 

each heading type, as shown in FIG. la . flow of the document. For example, a document may have 

After classifying all original paragraph types defined in four segments organized as chapters S(fca-c and an appen- 

the paragraph catalog, the system considers all original 5 dix 802rf, two of which have sections 804a-<f. Each section 

subparagraph types defined in the character catalog. In one may have one or more paragraphs 806a-t. 

embodiment, all character types are mapped to a default As shown in F[G. 9, the system begins to build a struc- 

character tag recognized by the destination system. tured representation of a document by reading an unstruc- 

However, other embodiments may consider factors such as . tured electronic document one paragraph at a time and 

bounding entities (e.g., quotation marks, underlines, and 10 getting the tag for the paragraph (step 902). Each tag has a 

parentheses), bold text, italics, and highlighting techniques. tag level that was assigned during the sorting phase, and 

For paragraphs and subparagraphs that do not have tags, entered into a table such as shown in FIG, la. If the tag level 

the system can analyze untagged paragraphs and is zero (step 904), the paragraph is an ordinary paragraph. In 

subparagraphs, and assign appropriate tags. For example, a this case, the system appends the paragraph at the current 

series of one-line numbered paragraphs may be tagged as an 15 level (step 906). The system handles the paragraph contents 

ordered list, or a quoted series of characters may be tagged (step 908) as terminal nodes. The contents include characters 

as a quoted passage. and links. A link causes the system to add link information 

Prior to building a tree structure (step 206) that represents t0 a unk destination table, as shown in FIG. 9a, which is a 

the hierarchical organization of instances in the document, table created during this tree-building phase. The link des- 

the system assigns a level to each structural paragraph type. 20 tination table includes a source electronic document 

An ordinary paragraph is assigned a level of 0. Heading and identifier, an internal location of the link in the source and 

list types are assigned levels using a sorting technique. The destination electronic documents, and a pointer to a node in 

system may use any sorting technique, for example, a bubble me tree structure. 

sort, quick sort, or insertion sort, using the comparison If the tag level is not zero, the paragraph creates a section 
technique as shown in FIG. 6. The comparison technique 25 node, which represents a branch such as a heading or 
selects two items at a time from a heading table, as shown beginning of a list. If the tag level is less than or equal to the 
in FIG. la, compares the two items, and assigns each a level. current level, the system walks up the tree (step 914) until 
The sorting technique makes additional passes until all items the tag level is greater than the current level. The system 
are ordered and assigned the appropriate level. generates a section node at this level (step 916). Tne section 
As shown in FIG. 6, the technique that compares headings n°de begins a new branch of the tree. The node is identified 
checks several attributes for two heading types, A and B. The as a section node and the contents of the paragraph are the 
system gets two tags from the heading mapping table (step . nrst children of the node (e.g., the heading text). 
602), and first checks the names of the heading tags (step A hierarchical organization enables the system to rear- 
606). 35 range sections and divide an electronic document into sub- 
Referring to FIG. 7, exemplary steps for comparing files at specified branches in the tree structure. Furthermore, 
names are shown. If the heading names are similar, the a hierarchical organization provides a means for identifying 
headings may be different levels. The system checks structural groups as branches of a tree. Branches may 
whether name A and name B have the same number of represent lists, chapters, sections, subsections, and foot- 
characters and end with a number (step 702). If all characters ^ notes. A document development system can also display the 
except the last are identical (step 704), the last character in structural organization of an electronic document and allow 
name A and the last character in name B are compared (step users to specify portions of the electronic document as 
706). The heading with the greater number is deemed the targets for specific operations. Such operations may include 
lesser heading (step 708 and step 710). The system enters the format changes, searches, word and phrase replacements, 
level into the heading table, as shown in FIG. la. The lower ^ s and extractions on all instances in one or more structural 
the heading number, the closer the level is to the root in the groups. 

tree structure. For example, the system assigns a lower level The system creates destination electronic documents by 

to a heading named Heading2 than a heading named writing the paragraph instances to files according to the tree 

Headingl, and Headingl is closer to the root node, structure. The system walks the tree structure and writes the 

Examples of other presentational attributes that the sys- 50 contents of each node, along with the appropriate tags to the 

tern checks, as shown in FIG. 6, are whether a heading file associated with the node. 

straddles columns (step 608), the font sizes (step 610) and Section nodes represent paragraph instances where the 

font weights (step 612) if the font is the same family, system carj divide or rearrange the electronic document, 

whether a paragraph adjoins (i.e., runs into) the following Tj smg the heading table, as shown in FIG. la, specific nodes 

paragraph (step 614), the indentations (step 616), and the 5S ^ identified as nodes where the system can split the tree, 

font sizes and weights from different font families (step \ a one embodiment, a tree structure may be subdivided at 

618): The system checks attributes with greater weights first. every branch, as shown in FIG. 10. To divide a structured 

The result of the comparison is that the A heading is more electronic document in this way, the system traverses the 

major (step 620) or the B heading is more major (step 622). tree , node by node (step 1010). The system checks for 

The system assigns levels to list types in a similar manner 60 section nodes (step 1012). If the node is a section node, the 

as it assigns levels to heading types. It uses a sorting system creates a new electronic sub-document (step 1016) 

technique that compares names, numbering formats, and generates a destination identifier (step 1018) that is 

indentations, and font sizes and weights. The system can entered in the link destination table, as shown in FIG. 9a. 

include additional comparisons during the sort for other The electronic sub-documents, except the first, have links 

paragraph characteristics. 65 from the parent electronic sub-document (step 1020). The 

Referring to FIG. 8, a hierarchical structure represents link creates a natural flow from one electronic sub -document 

structural groups of paragraph instances in a document, to the next, for example, the link may be used as a hypertext 
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link. For each node, the system labels the node with a 
destination identifier (steps 1022). 

The system provides several ways for dividing an elec- 
tronic document. Tne system can divide the electronic 
document at pre-defined levels. Levels can be designated by 
a user manually. Levels can be determined by pre-defined or 
automatically determined size limitations. 

A number of user interface techniques can be used to 
divide or rearrange a document. The system can display the 
tree structure in a user interface. A user can use a pointing 
device, such as a mouse, to specify areas in the tree to move 
or specify where the system should divide the electronic 
document. 

When a file is rearranged or divided, the system maintains 
all internal and external document links. The link destination 
table, as shown in FIG. 9a, that the system created during the 
tree-building phase makes this possible. When the system 
encounters a link node, the system resolves the link by 
finding the row in the link destination table containing the 
entry for the source electronic document, the link location, 
and the destination node having the destination identifier. 

Other embodiments are within the scope of the following 
claims. For example, the order of performing steps of the 
invention may be changed by those skilled in the art and still 
achieve desirable results. Weighting factors may be 
changed. Additional steps and additional factors may be 
added. Steps and factors may also be omitted. Name check- 
ing can be extended to include headings with foreign names. 

What is claimed is: 

1. A computer-implemented method for inferring struc- 
ture information in an electronic document, comprising: 

identifying a plurality of paragraph types in a source 
electronic document; 

gathering statistics for the paragraph types in the source 
electronic document, wherein the statistics are based on 
a count of paragraph instances having the same one 
paragraph type assigned, a count of lines of paragraph 
instances having the same one paragraph type, and a 
count of batch length of one paragraph type; 

mapping each paragraph type to one of a plurality of 
structural types; and 

using the statistics for the paragraph types to determine 
the structural type for one paragraph type. 

2. The method of claim 1, wherein mapping one para- 
graph type to one structural type comprises comparing a 
name of the one paragraph type to a word connoting one of 
the plurality of structural types. 

3. The method of claim 1, further comprising: 
identifying a plurality of subparagraph types assigned to 

a plurality of characters in the source electronic docu- 
ment; and 

mapping each subparagraph type to a pre-defined char- 
acter type by examining a plurality of presentational 
attributes of the subparagraph type. 

4. A computer-implemented method for inferring struc- 
ture information in an electronic document, comprising: 

identifying a plurality of paragraph types in a source 

electronic document; 
mapping each paragraph type to one of a plurality of 

structural types, this mapping comprising: 

examining a paragraph placement for a first one of the 
paragraph types; 

comparing a count of paragraph instances to which the 
first one of the paragraph types is assigned to 0 and 
a count of batches for the first one of the paragraph 
types to 0 if the paragraph placement is not a side 
placement; 



10 



15 



20 



25 



30 



35 



45 



50 



55 



65 



examining a first character and a last character of a 
name of the first one of the paragraph types if the 
paragraph placement is not a side placement and the 
count of paragraph instances and the count of 
batches is 0; and 

comparing the name of the first one of the paragraph 
types to a word connoting a first one of the structural 
types if the first character and the last character of the 
name do not connote first one of the structural types. 

5. The method of claim 4, wherein mapping one para- 
graph type to one structural type further comprises: 

computing a weighting factor if the count of paragraph 
instances and the count of batches are greater than 0; 
and 

determining the probability that the name connotes the 
one structural type by comparing the weighting factor 
to a predetermined value. 

6. The method of claim 5, wherein computing the weight- 
ing factor comprises: 

setting the weighting factor to an inverse of an average 

batch length multiplied by an average number of lines; 
multiplying the weighting factor by 1.5 if the first one of 

the paragraph types is automatically numbered; 
multiplying the weighting factor by 1.5 if the first one of 

the paragraph types straddles multiple columns; 
comparing the weighting factor to 0.9; and 
mapping the first one of the paragraph types to a heading 

if the weighting factor exceeds 0.9. 

7. A computer-implemented method for constructing a 
hierarchical organization of paragraph instances from an 
unstructured electronic document, comprising: 

assigning one of a plurality of hierarchical levels to one of 
a plurality of structural types in an unstructured elec- 
tronic document, wherein this assigning comprises 
sorting the structural types that are a heading structural 
type by structural name, assigned font, and indentation 
specification; 

associating one of a plurality of paragraph instances in the 
unstructured electronic document with one of the plu- 
rality of structural types; and 

constructing a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural type. 

8. The method of claim 7 wherein sorting by structural 
name comprises: 

comparing a length of a first structural name to a length 
of a second structural name; 

comparing a last character in the first structural name and 
a last character in the second structural name if the first 
structural name and the second structural name are the 
same length, end with a number, and have identical 
characters except the last character; and 

designating a more major heading to the one of the first 
structural name and the second structural name having 
a greater last character. 

9. A computer-implemented method for constructing a 
hierarchical organization of paragraph instances from an 
unstructured electronic document, comprising: 

assigning one of a plurality of hierarchical levels to one of 
a plurality of structural types in an unstructured elec- 
tronic document, wherein this assigning comprises 
sorting each structural type that is a list structural type 
by structural name and indentation specifications; 

associating one of a plurality of paragraph instances in the 
unstructured electronic document with one of the plu- 
rality of structural types; and 
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constructing a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural type. 

10. A computer-implemented method for constructing a 
hierarchical organization of paragraph instances from an 
unstructured electronic document, comprising: 

assigning one of a plurality of hierarchical levels to one of 
a plurality of structural types in an unstructured elec- 
tronic document; 

associating one of a plurality of paragraph instances in the 
unstructured electronic document with one of the plu- 
rality of structural types; and 

constructing a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural type wherein this con- 
structing comprises: 

appending a paragraph instance to a current tier if the 
hierarchical level of the structural type with which 
the paragraph instance is associated is equal to 0; 

assigning the current tier to a parent level having a 
lesser tier value if the hierarchical level of the 
structural type with which the paragraph instance is 
associated is not 0 and is less than or equal to the 
current tier, until the hierarchical level of the struc- 
tural type with which the paragraph instance is 
associated is greater than the current tier; and 

generating a section node to begin a new branch of the 
hierarchical organization if the hierarchical level of 30 
the structural type with which the paragraph instance 
is associated is not 0 and is greater than the current 
tier. 

11. A computer program for constructing a hierarchical 
organization of paragraph instances from an unstructured 
electronic document, comprising instructions operable to 
cause a computer to: 

associate one of a plurality of paragraph instances in an 
unstructured electronic document with one of the plu- 
rality of structural types; 
assign one of a plurality of hierarchical levels to one of a 

plurality of structural types; and 
construct a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural type, the instructions to 
construct a hierarchical organization of paragraph 
instances comprising instructions to: 
append a paragraph instance to a current tier if the 
hierarchical level of the structural type with which 
the paragraph instance is associated is equal to 0; 
assign the current tier to a parent level having a lesser 
tier value if the hierarchical level of the structural 
type with which the paragraph instance is associated 
is not 0 and is less than or equal to the current tier, 
until the hierarchical level of the structural type with 
which the paragraph instance is associated is greater 
than the current tier; and 
generate a section node to begin a new branch of the 
hierarchical organization if the hierarchical level of 60 
the structural type with which the paragraph instance 
is associated is not 0 and is greater than the current 
tier. 

12. A computer program for constructing a hierarchical 
organization of paragraph instances from an unstructured 65 
electronic document, comprising instructions operable to 
cause a computer to: 
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associate one of a plurality of paragraph instances in an 
unstructured electronic document with one of the plu- 
rality of structural types; and 

assign one of a plurality of hierarchical levels to one of a 
plurality of structural types, the instructions to assign 
comprising instructions operable to cause a computer to 
sort the structural types that are a heading structural 
type by structural name, assigned font, and indentation 
specification. 

13. The product of claim 12, wherein instructions to sort 
by structural name comprise instructions to: 

compare a length of a first structural name to a length of 
a second structural name; 

compare a last character in the first structural name and a 
last character in the second structural name if the first 
structural name and the second structural name are the 
same length, end with a number, and have identical 
characters except the last character; and 

designate a more major heading to the one of the first 
structural name and the second structural name having 
a greater last character. 

14. A computer program for constructing a hierarchical 
organization of paragraph instances from an unstructured 
electronic document, comprising instructions operable to 
cause a computer to: 

associate one of a plurality of paragraph instances in an 
unstructured electronic document with one of the plu- 
rality of structural types; and 

assign one of a plurality of hierarchical levels to one of a 
plurality of structural types, the instructions to assign 
comprising instructions operable to cause a computer to 
sort each structural type that is a list structural type by 
structural name and indentation specification. 

15. A computer program for inferring structure informa- 
tion in an electronic document, comprising instructions 
operable to cause a computer to: 

identify a plurality of paragraph types in a source elec- 
tronic document; 

gather statistics for the paragraph types in the source 
electronic document, wherein the statistics are based on 
a count of paragraph instances having the same one 
paragraph type assigned, a count of lines of paragraph 
instances having the same one paragraph type, and a 
count of batch length of one paragraph type; 

map each paragraph type to one of a plurality of structural 
types; and 

use the statistics for the paragraph types to determine the 
structural type for one paragraph type, 

16. The computer program of claim 15, wherein instruc- 
tions to map one paragraph type to one structural type 
comprise instructions to compare a name of the one para- 
graph type to a word connoting one of the plurality of 
structural types. 

17. The computer program of claim 15, further compris- 
ing instructions to: 

identify a plurality of subparagraph types assigned to a 
plurality of characters in the source electronic docu- 
ment; and 

map each subparagraph type to a pre-defined character 
type by examining a plurality of presentational 
attributes of the subparagraph type. 

18. A computer program for inferring structure informa- 
tion in an electronic document, comprising instructions 
operable to cause a computer to: 

identify a plurality of paragraph types in a source elec- 
tronic document; 
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map each paragraph type to one of a plurality of structural 
types, the instructions to map comprising instructions 
to: 

examine a paragraph placement for a first one of the 
paragraph types; 

compare a count of paragraph instances to which the 
first one of the paragraph types is assigned to 0 and 
a count of batches for the first one of the paragraph 
types to 0 if the paragraph placement is not a side 
placement; 

examine a first character and a last character of a name 
of the first one of the paragraph types if the para- 
graph placement is not a side placement and the 
count of paragraph instances and the count of 
batches is 0; and 
compare the name of the first one of the paragraph 
types to a word connoting a first one of the structural 
types if the first character and the last character of the 
name do not connote first one of the structural types. 
19. The computer program of claim 18, wherein instruc- 
tions to map one paragraph type to one structural type 
comprise instructions to: 



10 



15 



12 



compute a weighting factor if the count of paragraph 
instances and the count of batches are greater than 0; 
and 

determine the probability that the name connotes the one 
structural type by comparing the weighting factor to a 
predetermined value. 
20. The computer program of claim 19, wherein instruc- 
tions to compute the weighting factor comprise instructions 
to: 

set the weighting factor to an inverse of an average batch 
length multiplied by an average number of lines; 

multiply the weighting factor by 1.5 if the first one of the 
paragraph types is automatically numbered; 

multiply the weighting factor by 1.5 if the first one of the 
paragraph types straddles multiple columns; 

compare the weighting factor to 0.9; and 

map the first one of the paragraph types to a heading if the 
weighting factor exceeds 0.9. 
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